Skip to content

Conversation

guitargeek
Copy link
Contributor

@guitargeek guitargeek commented Sep 1, 2025

The RecurseCountsThreadLocal makes the assumption that only two instances are ever created per process. Therefore, its implementation can be optimized to the point that it should be as fast or faster than RecurseCountsTBBUnique, avoiding the Core dependence on TBB without compromising performance.

@guitargeek guitargeek self-assigned this Sep 1, 2025
@guitargeek guitargeek requested a review from dpiparo as a code owner September 1, 2025 23:43
Copy link

github-actions bot commented Sep 2, 2025

Test Results

    16 files      16 suites   2d 13h 47m 55s ⏱️
 3 618 tests  3 617 ✅ 0 💤 1 ❌
57 538 runs  57 537 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit d7797bc.

♻️ This comment has been updated with latest results.

@guitargeek guitargeek requested a review from pcanal as a code owner September 2, 2025 11:58
@guitargeek guitargeek changed the title [ci] Check what happens when building with tbb=ON [core] Implement RecurseCountsThreadLocal to be used in gCoreMutex Sep 2, 2025
@bendavid
Copy link
Contributor

bendavid commented Sep 2, 2025

I'm confused, doesn't RecurseCountsThreadLocal still depend on tbb::enumerable_thread_specific in this case?

The `RecurseCountsThreadLocal` makes the assumption that only two
instances are ever created per process. Therefore, its implementation
can be optimized to the point that it *should* be as fast or faster than
`RecurseCountsTBBUnique`, avoiding the Core dependence on TBB without
compromising performance.
As described in root-project#19798, we have an old build option that was supposedly
removed, but it is still considered in the implementation of
`ROOT::gCoreMutex`.

It would be good to know what happens when testing ROOT with it (no
differences are expected from the performance optimization with TBB, and
actually it was always enabled when building with `builtin_tbb=ON` as
well, as we do for macOS and Windows).
@guitargeek
Copy link
Contributor Author

Oh, pardon I forgot to remove the tbb::enumerable_thread_specific<LocalCounts> fLocalCounts data member! It's updated now.

@bendavid
Copy link
Contributor

bendavid commented Sep 2, 2025

So there's a maybe longstanding issue here related to bd1894b and discussed in #6919 (comment)

Given that this is similar to the old implementation in

struct UniqueLockRecurseCount {
using Hint_t = TVirtualRWMutex::Hint_t;
struct LocalCounts {
size_t fReadersCount = 0;
bool fIsWriter = false;
};
size_t fWriteRecurse = 0; ///<! Number of re-entry in the lock by the same thread.
UniqueLockRecurseCount();
using local_t = LocalCounts*;
local_t GetLocal(){
TTHREAD_TLS_DECL(LocalCounts, gLocal);
return &gLocal;
}
Hint_t *IncrementReadCount(local_t &local) {
++(local->fReadersCount);
return reinterpret_cast<TVirtualRWMutex::Hint_t *>(&(local->fReadersCount));
}
template <typename MutexT>
Hint_t *IncrementReadCount(local_t &local, MutexT &) {
return IncrementReadCount(local);
}
Hint_t *DecrementReadCount(local_t &local) {
--(local->fReadersCount);
return reinterpret_cast<TVirtualRWMutex::Hint_t *>(&(local->fReadersCount));
}
template <typename MutexT>
Hint_t *DecrementReadCount(local_t &local, MutexT &) {
return DecrementReadCount(local);
}
void ResetReadCount(local_t &local, int newvalue) {
local->fReadersCount = newvalue;
}
bool IsCurrentWriter(local_t &local) { return local->fIsWriter; }
bool IsNotCurrentWriter(local_t &local) { return !local->fIsWriter; }
void SetIsWriter(local_t &local)
{
// if (fWriteRecurse == std::numeric_limits<decltype(fWriteRecurse)>::max()) {
// ::Fatal("TRWSpinLock::WriteLock", "Too many recursions in TRWSpinLock!");
// }
++fWriteRecurse;
local->fIsWriter = true;
}
void DecrementWriteCount() { --fWriteRecurse; }
void ResetIsWriter(local_t &local) { local->fIsWriter = false; }
size_t &GetLocalReadersCount(local_t &local) { return local->fReadersCount; }
};

@guitargeek
Copy link
Contributor Author

Oh! I see. Interesting

@guitargeek
Copy link
Contributor Author

I wonder how TBB avoids this problem that Philippe described...

@guitargeek guitargeek marked this pull request as draft September 2, 2025 12:52
@pcanal
Copy link
Member

pcanal commented Sep 2, 2025

I wonder how TBB avoids this problem that Philippe described...

The difference between UniqueLockRecurseCount and RecurseCounts (and likely the TBB implementation) was switching from using thread_local (which, if I recall correctly, requires/required via a call to tls_get_addr_tail to indirectly take the same lock as ...something else maybe related to library loading or more likely static initialization) to only using std::this_thread::get_id().

The real challenge with this area is that seeing it work in most (all?) cases does not tell us we don't have the problem ... as I was unable at the time to deterministically reproduce the issue (see log bd1894b).

At the very least, one should revert bd1894b (i.e. using the old RecurseCounts) to check if we see the problem. If we see the problem then we have some degree of confidence that having the alternative code not failing means that they might not have the same problem. If we do not see the problem then ... we can not tell :(

@pcanal
Copy link
Member

pcanal commented Sep 2, 2025

In tlslock.tar.gz I got to the spirit of what I remember

. go
Starting execution
Start thread 1 workload
Start thread 2 workload
Thread 2 sleeps 2 seconds while holding the lock
Inner most counter for thread 0x16f7eb000 is 1
Thread 1 will request the lock
Thread 2 done sleeping and will initialize a thread_local
# Process is stuck forever.

on macos and linux. The process is actually stuck in the ROOT lock ....

Updating the example by commenting out ROOT::EnableThreadSafety(); in main.cxx shows a slightly different behavior but then again it is not using gSystem is a thread safe way

thread 2 is waiting for the lock:

#0  0x00007ffff3e873f0 in __lll_lock_wait () from /lib64/libc.so.6
#1  0x00007ffff3e8d582 in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x0000000000404505 in __gthread_mutex_lock (__mutex=0x4081c0 <gMainMutex>) at /cvmfs/cms.cern.ch/el9_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/include/c++/11.4.1/x86_64-redhat-linux-gnu/bits/gthr-default.h:749
#3  0x0000000000404634 in std::mutex::lock (this=0x4081c0 <gMainMutex>) at /cvmfs/cms.cern.ch/el9_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/include/c++/11.4.1/bits/std_mutex.h:100
#4  0x00007ffff45374ad in innermost () at inner.cxx:21

while the other thread is waiting on the system lock

(gdb) thread 3
[Switching to thread 3 (LWP 4021866)]
#0  0x00007ffff3e873f0 in __lll_lock_wait () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff3e873f0 in __lll_lock_wait () from /lib64/libc.so.6
#1  0x00007ffff3e8d5dd in pthread_mutex_lock@@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x00007ffff7fd3393 in _dl_open (file=0x7fffcc000c40 "/var/tmp/tlslock/libWithThreadLocal.so", mode=-2147483391,
    caller_dlopen=0x7fffe4cbf2be <cling::utils::platform::DLOpen(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+30>,
    nsid=-2, argc=1, argv=0x7fffffffbc18, env=0x7fffec0226f0) at dl-open.c:819

it is not clear whether the example is too convoluted.

@pcanal
Copy link
Member

pcanal commented Sep 3, 2025

This second example does also show the essence of the problem without ROOT.
tlslock2.tar.gz

Side note: I am guessing the example can be simply further (by removing one of the layer of libraries)

@bendavid
Copy link
Contributor

bendavid commented Sep 3, 2025

One thing I'm not sure I understood here. Would it be sufficient to ensure that the thread_local variables are initialized during the loading of libThread (and/or in the subsequent creation of new therads)? (as opposed to potentially later at the first call to GetLocal()?

@pcanal
Copy link
Member

pcanal commented Sep 3, 2025

Would it be sufficient to ensure that the thread_local variables are initialized during the loading of libThread

it can not be done, can it? The official thread_local variables need to be initialized for each (new) threads.

@bendavid
Copy link
Contributor

bendavid commented Sep 3, 2025

Yes, but in this case it could/should be initialized for all existing threads at library load time and for any additional threads immediately when the thread is created.

(As opposed to when execution reaches some particular local scope, I think as in

https://en.cppreference.com/w/cpp/language/storage_duration.html#Static_block_variables

)

@pcanal
Copy link
Member

pcanal commented Sep 3, 2025

Yes, but in this case it could/should be initialized for all existing threads

I can see how that would work for a static variable .. but I am confused on how it could work for a thread_local since there is an infinite (indeterminated) number of (upcoming) threads.

@bendavid
Copy link
Contributor

bendavid commented Sep 3, 2025

Right, for threads created after the library is loaded I would assume there are some cases where the thread_local variables would be initialized immediately upon the creation of the thread. (but not necessarily for the thread_local block variable case I linked above)

@pcanal
Copy link
Member

pcanal commented Sep 3, 2025

Right, for threads created after the library is loaded I would assume there are some cases where the thread_local variables would be initialized immediately upon the creation of the thread. (but not necessarily for the thread_local block variable case I linked above)

Humm ... intriguing ... We might indeed be able to use a 'thread_local` declared in the global scope rather than inside a function in order to avoid the issue.

@bendavid
Copy link
Contributor

bendavid commented Sep 3, 2025

Yes, or possibly as a static thread_local class data member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants